Winter Olympics Medals over Time

Scenario

Imagine you are the data scientist at a respected media outlet – say the “New York Times”. For the Winter Olympics coverage, your editor-in-chief asks you to analyze some data on the history of Winter Olympics Medals by Year, Country, Event and Gender and prepare some data visualizations in which you outline the main patterns around which to base the story.

Since there is no way that all features of the data can be represented in such a memo, feel free to pick and choose some patterns that would make for a good story – outlining important patterns and presenting them in a visually pleasing way.

The full background and text of the story will be researched by a writer of the magazine – your input should be based on the data and some common sense (i.e. no need to read up on this).

Provide polished plots that are refined enough to include in the magazine with very little further manipulation (already include variable descriptions [if necessary for understanding], titles, source [e.g. “International Olympic Committee”], right color etc.) and are understandable to the average reader of the “New York Times”. The design does not need to be NYTimes-like. Just be consistent.

Data

The main data is provided as an excel sheet, containing the following variables on all participating athletes in all olympics from 1896 to 2016 (sadly, the original source of the data no longer updates beyond that year):

  • ID: a unique indentifier of the entry
  • Name: name of the athlete
  • Sex: sex of the athlete
  • Age: age of the athlete
  • Height: height of the athlete
  • Weight: weight of the athlete
  • Team: usually the country team of the athlete, with the exception of political accomodations, e.g. the “Refugee Olympic Athletes” team.
  • NOC: national olympic comittee abbreviation.
  • Games: year and season of games.
  • Year: year of games
  • Season: season of games.
  • City: host city
  • Sport: a grouping of disciplines
  • Event: the particular event / competition
  • Medal: the particular event / competition

For example, an event is a competition in a sport or discipline that gives rise to a ranking. Thus Alpine Skiing is the discipline, and Alpine Skiing Women's Downhills is a particular event.

In addition, you are provided with some additional information about the countries in a separate spreadsheet, including the IOC Country Code, Population, and GDP per capita.

Tasks

1. Medal Counts over Time

  1. Combine the information in the three spreadsheets athletes_and_events.csv, noc_regions.csv, and gdp_pop.csv. Note, that the noc_regions.csv is the set all NOC regions, while gdp_pop.csv only contains a snapshot of the current set of countries. You have to decide what to do with some countries that competed under different designations in the past (e.g. Germany and Russia) and some defunct countries and whether and how to combine their totals. Make sure to be clear about your decisions here, so that the editor (and potentially a user of your visualizations) understands what you did.
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.6     ✓ purrr   0.3.4
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
aaee <- read.csv("/Users/yanlinzhang/Desktop/Data Visualization/assignment-1---winter-olympics-calvinzyl/data/athletes_and_events.csv")
gdp <- read.csv("/Users/yanlinzhang/Desktop/Data Visualization/assignment-1---winter-olympics-calvinzyl/data/gdp_pop.csv")
noc <- read.csv("/Users/yanlinzhang/Desktop/Data Visualization/assignment-1---winter-olympics-calvinzyl/data/noc_regions.csv")
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
# One-hot encode gold, silver, and bronze 
dum <- aaee[ , c("ID", "Medal")]
dummy <- dummyVars(" ~ .", data=dum)
dummy <- data.frame(predict(dummy, newdata=dum))
dummy[is.na(dummy)] <- 0

# Replace strings in Medal with counts (1's)
aae2 <- aaee
aae2["Medal"][aae2["Medal"] == "Gold"] <- 1
aae2["Medal"][aae2["Medal"] == "Silver"] <- 1
aae2["Medal"][aae2["Medal"] == "Bronze"] <- 1
aae2[is.na(aae2)] <- 0

ouch <- data.frame(dummy, aae2)

# Drop all rows from Summer Olympics
aae3 <- ouch[!(ouch$Season=="Summer"),] 

# Ensure team events are counted as a single medal
aae <- distinct(aae3, NOC, Event, Year, Medal, MedalGold, MedalBronze, MedalSilver)

# Compute total winter games each country competed in
games <- aae3 %>% group_by(NOC) %>% 
  summarise(Winter_Games=n_distinct(Games))

# Compute total medals each country won
medals <- aae %>% group_by(NOC) %>%
            summarise(Medals=sum(as.numeric(Medal)),
                      .groups='drop')
golds <- aae %>% group_by(NOC) %>%
            summarise(Golds=sum(as.numeric(MedalGold)),
                      .groups='drop')
silvers <- aae %>% group_by(NOC) %>%
            summarise(Silvers=sum(as.numeric(MedalSilver)),
                      .groups='drop')
bronzes <- aae %>% group_by(NOC) %>%
            summarise(Bronzes=sum(as.numeric(MedalBronze)),
                      .groups='drop')

# Merge games, medals, and the noc dataframe
small_merge1 <- data.frame(games, medals, golds, silvers, bronzes)
small_merge2 <- inner_join(small_merge1, noc, by="NOC")
small_merge2 <- small_merge2[order(small_merge2$region, decreasing=FALSE),]

# Replace past or defunct countries with their current code, if applicable
small_merge2$NOC[small_merge2$NOC == "HKG"] <- "CHN"
small_merge2$NOC[small_merge2$NOC == "TCH"] <- "CZE"
small_merge2$NOC[small_merge2$NOC == "GDR"] <- "GER"
small_merge2$NOC[small_merge2$NOC == "FRG"] <- "GER"
small_merge2$NOC[small_merge2$NOC == "EUN"] <- "RUS"
small_merge2$NOC[small_merge2$NOC == "URS"] <- "RUS"
small_merge2$NOC[small_merge2$NOC == "SCG"] <- "SRB"
small_merge2$NOC[small_merge2$NOC == "YUG"] <- "SRB"

# Sum up total games and medals for countries with multiple designations
small_merge3 <- small_merge2 %>% 
  group_by(NOC) %>% 
  summarize(Winter_Games=sum(Winter_Games), Medals=sum(Medals), Golds=sum(Golds), Silver=sum(Silvers), Bronze=sum(Bronzes))

gdp <- gdp %>% 
  rename(NOC = Code)

Combined <- dplyr::inner_join(small_merge3, gdp, by="NOC") 
print(Combined)
## # A tibble: 106 × 9
##    NOC   Winter_Games Medals Golds Silver Bronze Country              Population
##    <chr>        <int>  <dbl> <dbl>  <dbl>  <dbl> <chr>                     <int>
##  1 AHO              2      0     0      0      0 Netherlands Antille…         NA
##  2 ALB              3      0     0      0      0 Albania                 2889167
##  3 ALG              3      0     0      0      0 Algeria                39666519
##  4 AND             11      0     0      0      0 Andorra                   70473
##  5 ARG             18      0     0      0      0 Argentina              43416755
##  6 ARM              6      0     0      0      0 Armenia                 3017712
##  7 ASA              1      0     0      0      0 American Samoa*           55538
##  8 AUS             19     13     6      3      4 Australia              23781169
##  9 AUT             22    218    59     78     81 Austria                 8611088
## 10 AZE              5      0     0      0      0 Azerbaijan              9651349
## # … with 96 more rows, and 1 more variable: GDP.per.Capita <dbl>

For countries with different designations in the past, I added up all the games and medals from those designations since I want align the NOCs with current set of countries but also take into account their past achievements under other designations. For example, I added medals and games from United Team and Soviet Union to those from Russia. Same rule applies to United Team of Germany and East Germany. Those governments of the same country are not strictly administratively identical, but I append their medal and game counts for the purpose of an up-to-date summary.

  1. Calculate a summary of how many winter games each country competed in, and how many medals of each type the country won. Use that summary to provide a visual comparison of medal count by country.
# Data frame summary of how many winter games each country competed in and how many medals of total medal and each type the country won
summary_games_by_country <- Combined[,c("NOC","Winter_Games", "Medals","Golds", "Silver","Bronze")]
print(summary_games_by_country)
## # A tibble: 106 × 6
##    NOC   Winter_Games Medals Golds Silver Bronze
##    <chr>        <int>  <dbl> <dbl>  <dbl>  <dbl>
##  1 AHO              2      0     0      0      0
##  2 ALB              3      0     0      0      0
##  3 ALG              3      0     0      0      0
##  4 AND             11      0     0      0      0
##  5 ARG             18      0     0      0      0
##  6 ARM              6      0     0      0      0
##  7 ASA              1      0     0      0      0
##  8 AUS             19     13     6      3      4
##  9 AUT             22    218    59     78     81
## 10 AZE              5      0     0      0      0
## # … with 96 more rows

Feel free to focus on smaller set of countries (say the top 10), highlight the United States or another country of your choice, consider gender of the medal winners etc. to make the visualization interesting.

Please provide (i) one visualization showing an over time comparison and (ii) one visualization in which a total medal count (across all Winter Olympics) is used. Briefly discuss which visualization you recommend to your editor and why.

Note: Currently, the medal data contains information on each athlete competing, including for team events. For example, in 2014 Russia received 4 gold medals for their men’s win in Bobsleigh Men’s Four alone. Since this is usually not how it is done in official medal statistics, try to wrangle the data so that team events are counted as a single medal.

# Sort the combined data frame in an decreasing order based on medal counts
sorted_medal <- Combined[order(Combined$Medals, decreasing=TRUE),]
sorted_medal_graph <- head(sorted_medal, 10)

sorted_medal_graph <- sorted_medal_graph[order(sorted_medal_graph$Medals, decreasing=TRUE),]

# Create bar chart to show top 10 countries with most medals
medal_count_winter <- ggplot(sorted_medal_graph, aes(x=reorder(NOC, -Medals), y=Medals)) + 
  geom_bar(stat="identity",width=0.7) + 
  labs(y="Total Number of Medals", x="National Olympic Committee") + 
  ggtitle("Top 10 Countries with Most Winter Olympic Medals of All Time")

# Create bar chart to show top 10 countries with most golds
gold_count_winter <- ggplot(sorted_medal_graph, aes(x=reorder(NOC, -Golds), y=Golds)) + 
  geom_bar(stat="identity",width=0.7) + 
  labs(y="Total Number of Golds") +
  theme(axis.text.x = element_text(size=5),
        axis.title.x = element_blank()) +
  coord_cartesian(ylim = c(0, 140))

# Create bar chart to show top 10 countries with most silvers
silver_count_winter <- ggplot(sorted_medal_graph, aes(x=reorder(NOC, -Silver), y=Silver)) + 
  geom_bar(stat="identity",width=0.7) + 
  labs(y="Total Number of Silvers") +
  theme(axis.text.x = element_text(size=5),
        axis.title.x = element_blank()) +
  coord_cartesian(ylim = c(0, 140))

# Create bar chart to show top 10 countries with most bronzes
bronze_count_winter <- ggplot(sorted_medal_graph, aes(x=reorder(NOC, -Bronze), y=Bronze)) + 
  geom_bar(stat="identity",width=0.7) + 
  labs(y="Total Number of Bronzes") +
  theme(axis.text.x = element_text(size=5),
        axis.title.x = element_blank()) +
  coord_cartesian(ylim = c(0, 140))

library(patchwork)

# Align four graphs
medal_count_winter / (gold_count_winter + silver_count_winter + bronze_count_winter)

# Remove the information that does not belong to selected top 10 countries
rus <- filter(aae, NOC %in% c("RUS"))
usa <- filter(aae, NOC %in% c("USA"))
ger <- filter(aae, NOC %in% c("GER"))
can <- filter(aae, NOC %in% c("CAN"))
nor <- filter(aae, NOC %in% c("NOR"))
swe <- filter(aae, NOC %in% c("SWE"))
fin <- filter(aae, NOC %in% c("FIN"))
aut <- filter(aae, NOC %in% c("AUT"))
sui <- filter(aae, NOC %in% c("SUI"))
cze <- filter(aae, NOC %in% c("CZE"))

# For each country, group by year and summarize medal counts to prepare for over time trend visualizations
rus_trend <- rus %>%
  group_by(Year) %>%
  summarise(Medals=sum(as.numeric(Medal)),
            .groups='drop') %>% 
  mutate(NOC="RUS",
         .after="Medals")

usa_trend <- usa %>%
  group_by(Year) %>%
  summarise(Medals=sum(as.numeric(Medal)),
            .groups='drop') %>% 
  mutate(NOC="USA",
         .after="Medals")

ger_trend <- ger %>%
  group_by(Year) %>%
  summarise(Medals=sum(as.numeric(Medal)),
            .groups='drop') %>% 
  mutate(NOC="GER",
         .after="Medals")

can_trend <- can %>%
  group_by(Year) %>%
  summarise(Medals=sum(as.numeric(Medal)),
            .groups='drop') %>% 
  mutate(NOC="CAN",
         .after="Medals")

nor_trend <- nor %>%
  group_by(Year) %>%
  summarise(Medals=sum(as.numeric(Medal)),
            .groups='drop') %>% 
  mutate(NOC="NOR",
         .after="Medals")

swe_trend <- swe %>%
  group_by(Year) %>%
  summarise(Medals=sum(as.numeric(Medal)),
            .groups='drop') %>% 
  mutate(NOC="SWE",
         .after="Medals")

fin_trend <- fin %>%
  group_by(Year) %>%
  summarise(Medals=sum(as.numeric(Medal)),
            .groups='drop') %>% 
  mutate(NOC="FIN",
         .after="Medals")

aut_trend <- aut %>%
  group_by(Year) %>%
  summarise(Medals=sum(as.numeric(Medal)),
            .groups='drop') %>% 
  mutate(NOC="AUT",
         .after="Medals")

sui_trend <- sui %>%
  group_by(Year) %>%
  summarise(Medals=sum(as.numeric(Medal)),
            .groups='drop') %>% 
  mutate(NOC="SUI",
             .after="Medals")

cze_trend <- cze %>%
  group_by(Year) %>%
  summarise(Medals=sum(as.numeric(Medal)),
            .groups='drop') %>% 
  mutate(NOC="CZE",
          .after="Medals")

plot <- rbind(rus_trend, usa_trend, ger_trend, can_trend, nor_trend,
              swe_trend, fin_trend, aut_trend, sui_trend, cze_trend)

library("ggthemes")

# Create a line plot with top 10 countries' medal counts over time
medal_trend <- ggplot(data = plot, aes(x=Year, y=Medals, group=NOC)) +
  theme_clean() +
  geom_line(aes(color=NOC)) +
  labs(x="National Olympic Committee") + 
  ggtitle("Top 10 Countries with Most Winter Olympic Medals over time")

print(medal_trend)

Between the over time medal comparison and medal count by countries, I prefer the latter (bar chart combination). This is because a basic bar chart is in my opinion the most suitable graphical representation of comparison. With comparisons of each type of medals in addition, viewers are able to see the medal breakdown of each country as well, although a stacked bar chart in this case would be a more concise choice. On the contrary, in terms of static graphics, the time-series plot above looks messy, making viewers hard to distinguish between different countries. Moreover, almost all countries were absent in some of the Winter Olympics, which makes the over time trends less accurate.

2. Medal Counts adjusted by Population, GDP

There are different ways to calculate “success”. Consider the following variants and choose one (and make sure your choice is clear in the visualization):
- Just consider gold medals.
- Simply add up the number of medals of different types.
- Create an index in which medals are valued differently. (gold=3, silver=2, bronze=1).
- A reasonable other way that you prefer.

Now, adjust the ranking of medal success by (a) GDP per capita and (b) population. You have now three rankings: unadjusted ranking, adjusted by GDP per capita, and adjusted by population.

Visualize how these rankings differ. Try to highlight a specific pattern (e.g. “South Korea – specialization reaps benefits” or “The superpowers losing their grip”).

unadjusted_ranking <- ggplot(sorted_medal_graph, aes(x=reorder(NOC, -Medals), y=Medals)) + 
  geom_bar(stat="identity",width=0.7) + 
  theme(axis.title.y = element_text(size = 8)) +
  labs(y="Total Number of Medals", x="National Olympic Committee") + 
  ggtitle("Unadjusted Ranking")

# Compute adjusted rankings
Combined$MedalsbyPop <- Combined$Medals/Combined$Population
Combined$MedalsbyGDP <- Combined$Medals/Combined$GDP.per.Capita

sorted_pop<- Combined[order(Combined$MedalsbyPop, decreasing=TRUE),]
sorted_pop_graph <- head(sorted_pop, 10)

sorted_gdp<- Combined[order(Combined$MedalsbyGDP, decreasing=TRUE),]
sorted_gdp_graph <- head(sorted_gdp, 10)

adjusted_ranking1 <- ggplot(sorted_pop_graph, aes(x=reorder(NOC, -MedalsbyPop), y=MedalsbyPop)) + 
  geom_bar(stat="identity",width=0.7) + 
  theme(axis.title.y = element_text(size = 8)) +
  labs(y="Medal Counts adjusted by Population", x="National Olympic Committee") + 
  ggtitle("Population-Adjusted Ranking")

adjusted_ranking2 <- ggplot(sorted_gdp_graph, aes(x=reorder(NOC, -MedalsbyGDP), y=MedalsbyGDP)) + 
  geom_bar(stat="identity",width=0.7) +
  theme(axis.title.y = element_text(size = 8)) +
  labs(y="Medal Counts adjusted by GDP", x="National Olympic Committee") + 
  ggtitle("GDP-Adjusted Ranking")

library("cowplot")
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggthemes':
## 
##     theme_map
## The following object is masked from 'package:patchwork':
## 
##     align_plots
# Align three bar charts
cowplot::plot_grid(unadjusted_ranking, adjusted_ranking1, adjusted_ranking2)

3. Host Country Advantage

Until the 2014 Sochi Winter Olympics (our data for Winter Olympics end here), there were 19 host cities. Calculate whether the host nation had an advantage. That is calculate whether the host country did win more medals when the Winter Olympics was in their country compared to other times.

Note, that the 19 host cities are noted in the data but not the countries they are located in. This happens commonly and often Wikipedia has the kind of additional data you want for the task. To save you some time, here is a quick way to get this kind of table from Wikipedia into R:

library(rvest)
library(stringr)
library(tidyverse)
wiki_hosts <- read_html("https://en.wikipedia.org/wiki/List_of_Olympic_Games_host_cities")
hosts <- html_table(html_nodes(wiki_hosts, "table")[[2]], fill=TRUE)[-1]
hosts %>% filter(Winter != "") %>%
  select(City, Country, Year)

Provide a visualization of the host country advantage (or absence thereof).

# For each country, annotate specific years with host cities
usa_host <- usa_trend %>%
  mutate(Host="None") %>% 
  mutate(City="")
usa_host$Host[usa_host$Year == 1960 | usa_host$Year == 1932 | usa_host$Year == 1980 | usa_host$Year == 2002] <- "Host"
usa_host$City[usa_host$Year == 1932] <- "1932\nLake Placid"
usa_host$City[usa_host$Year == 1960] <- "1960\nSquaw Valley"
usa_host$City[usa_host$Year == 1980] <- "1980\nLake Placid"
usa_host$City[usa_host$Year == 2002] <- "2002\nSalt Lake City"

aut_host <- aut_trend %>%
  mutate(Host="None") %>% 
  mutate(City="")
aut_host$Host[aut_host$Year == 1964 | aut_host$Year == 1976 ] <- "Host"
aut_host$City[aut_host$Year == 1964] <- "1964\nInnsbruck"
aut_host$City[aut_host$Year == 1976] <- "1976\nInnsbruck"

sui_host <- sui_trend %>%
  mutate(Host="None") %>% 
  mutate(City="")
sui_host$Host[sui_host$Year == 1928 | sui_host$Year == 1948 ] <- "Host"
sui_host$City[sui_host$Year == 1928] <- "1928\nSt. Moritz"
sui_host$City[sui_host$Year == 1948] <- "1948\nSt. Moritz"

can_host <- can_trend %>%
  mutate(Host="None") %>% 
  mutate(City="")
can_host$Host[can_host$Year == 1988 | can_host$Year == 2010 ] <- "Host"
can_host$City[can_host$Year == 1988] <- "1988\nCalgary"
can_host$City[can_host$Year == 2010] <- "2010\nVancouver"
# Use minimal bar plots to show four countries' over time medal counts and highlight the years of hosts
usa_host_graph <- ggplot(data=usa_host, aes(x=Year, y=Medals, color=Host)) +
  scale_color_grey() +
  theme_classic() +
  theme_tufte(base_size=13, ticks=TRUE) +
  geom_bar(width=1.7, fill="gray", stat = "identity") +
  geom_hline(yintercept=seq(0, 40, 10), col="white", lwd=1) +
  ggtitle("US Host Advantage?") +
  labs(y="Medal Counts") +
  theme(plot.title = element_text(size = 15, hjust = 0.5),
        axis.title.x=element_blank(),
        axis.title.y=element_text(size = 11),
        legend.position="none") +
  geom_text(size=2.7, data=usa_host, aes(label=City, vjust= -0.2))

aut_host_graph <- ggplot(data=aut_host, aes(x=Year, y=Medals, color=Host)) +
  scale_color_grey() +
  theme_classic() +
  theme_tufte(base_size=13, ticks=TRUE) +
  geom_bar(width=1.7, fill="gray", stat = "identity") +
  geom_hline(yintercept=seq(0, 20, 5), col="white", lwd=1) +
  ggtitle("Austria Host Advantage?") +
  labs(y="Medal Counts") +
  theme(plot.title = element_text(size = 15, hjust = 0.5),
        axis.title.x=element_blank(),
        axis.title.y=element_text(size = 11),
        legend.position="none") +
  geom_text(size=2.7, data=aut_host, aes(label=City, vjust= -0.2))

sui_host_graph <- ggplot(data=sui_host, aes(x=Year, y=Medals, color=Host)) +
  scale_color_grey() +
  theme_classic() +
  theme_tufte(base_size=13, ticks=TRUE) +
  geom_bar(width=1.7, fill="gray", stat = "identity") +
  geom_hline(yintercept=seq(0, 15, 5), col="white", lwd=1) +
  ggtitle("Switzerland Host Advantage?") +
  labs(y="Medal Counts") +
  theme(plot.title = element_text(size = 15, hjust = 0.5),
        axis.title.x=element_blank(),
        axis.title.y=element_text(size = 11),
        legend.position="none") +
  geom_text(size=2.7, data=sui_host, aes(label=City, vjust= -0.2))

can_host_graph <- ggplot(data=can_host, aes(x=Year, y=Medals, color=Host)) +
  scale_color_grey() +
  theme_classic() +
  theme_tufte(base_size=13, ticks=TRUE) +
  geom_bar(width=1.7, fill="gray", stat = "identity") +
  geom_hline(yintercept=seq(0, 30, 5), col="white", lwd=1) +
  ggtitle("Canada Host Advantage?") +
  labs(y="Medal Counts") +
  theme(plot.title = element_text(size = 15, hjust = 0.5),
        axis.title.x=element_blank(),
        axis.title.y=element_text(size = 11),
        legend.position="none") +
  geom_text(size=2.7, data=can_host, aes(label=City, vjust= -0.2))

library("patchwork")
(usa_host_graph + can_host_graph) / (sui_host_graph + aut_host_graph)

I randomly picked four countries to explore the host country advantage. One certainly include a lot more to show more convincing patterns, but due to time limit I only did four. It seems like the host advantange is generally true, where I define as the host country obtaining conspicuously more medals than it did in adjacent Olympic games. There is apparently one exception which is Switzerland’s 1928 St. Moritz Olympics, and I would say it’s reasonable especially in the very early stage of Winter Olympics where everything was not well developed yet.

4. Most successful athletes

  1. Now, let’s look at the most successful athletes. Provide a visual display of the most successful Winte Olympics athletes of all time.
aae3$Sex[aae3$Sex == "M"] <- 1
aae3$Sex[aae3$Sex == "F"] <- 0

# Group total medals, events, and gender by athlete names
athlete <- aae3 %>%
  mutate(Event_ct=1) %>% 
  group_by(Name) %>%
  summarise(Medals=sum(as.numeric(Medal)),
            Events=sum(Event_ct),
            Gender=mean(as.numeric(Sex)),
            .groups='drop') %>% 
  arrange(desc(Medals))

athlete$Gender[athlete$Gender == 1] <- "Male"
athlete$Gender[athlete$Gender == 0] <- "Female"

sorted_athlete <- head(athlete, 300) 

pointsToLabel <- c("Ole Einar Bjrndalen", "Raisa Petrovna Smetanina",
                   "Stefania Belmondo", "Yang Yang","Lyubov Ivanovna Yegorova",
                   "Kjetil Andr Aamodt")

# Visualize the relationship between events and medals among the most successful athletes
suc_ath <- ggplot(data=sorted_athlete,
                  aes(x=Medals, y=Events, color=Gender)) +
  scale_color_grey() +
  theme_gdocs() +
  geom_point(alpha=0.8, size = 3) +
  labs(y="Events", x="Medals") + 
  ggtitle("Most Successful Athletes of All Time") +
  geom_text(aes(label=Name, hjust=0.9, vjust=1.5),
            color = "gray20",
            data = filter(sorted_athlete, Name %in% pointsToLabel)) +
  theme(plot.title = element_text(size = 18, hjust = 0.5),
        axis.title.x=element_text(size = 13),
        axis.title.y=element_text(size = 13))

print(suc_ath)

This scatter plot is trying to show a few athletes with most number of medals obtained and events attended. There are, not surprisingly, some big names appearing in the graph such as Ole Bjrndalen, Yang Yang, Raisa Semtanina, and so on. I also grouped the points by gender, which does not show anything outstanding. Note that the plot is static, so there isn’t much room to illustrate whom each point corresponds to, and that’s the reason why I used ggplotly in Part 5 to add interactive labels of a similar plot, which is way better than this one.

  1. Chose of of the athlete specific dimensions (e.g. gender, height, weight) and visualize an interesting pattern in the data.
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(ggpubr)
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:cowplot':
## 
##     get_legend
aae3 <- aae3 %>% 
  filter(Height != "0") %>% 
  filter(Weight != "0")

# Filter out speed skating and ice hockey athlete dimensions for later comparison
sk <- aae3 %>% 
  filter(Sport == "Speed Skating") 

ih <- aae3 %>% 
  filter(Sport == "Ice Hockey")

# Visualize the relationship between weight and height in speed skating as well as ice hockey
sk_graph <- ggplot(data=sk,
                  aes(x=Weight, y=Height)) +
  geom_point(size=1, alpha=0.5) +
  scale_color_grey() +
  coord_cartesian(ylim = c(140, 210), xlim = c(40, 120)) +
  ggtitle("Speed Skating") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth() +
  stat_cor(label.y=205) +
  stat_regline_equation(label.y=200)

ih_graph <- ggplot(data=ih,
                  aes(x=Weight, y=Height)) +
  geom_point(size=1, alpha=0.5) +
  scale_color_grey() +
  coord_cartesian(ylim = c(140, 210), xlim = c(40, 120)) +
  ggtitle("Ice Hockey") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth() +
  stat_cor(label.y=205) +
  stat_regline_equation(label.y=200)

grid.arrange(sk_graph, ih_graph, ncol=2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

In here I did a simple camparison of athlete dimensions between speed skating and ice hockey. The visualization shows that the range of height does not really differ, but athletes in Ice Hockey have generally higher weights, which is fairly intuitive.

Interactivity

5. Make two plots interactive

Choose 2 of the plots you created above and add interactivity. One of the plots needs to be written in plotly rather than just using the ggplotly automation. Briefly describe to the editor why interactivity in these visualization is particularly helpful for a reader.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Use plotly to create an interactive over time medal count comparison between countries
plot_ly(plot, x=~Year, y=~Medals, color=~NOC,
        type="scatter",
        mode="lines+markers",
        line =list(width=3, dash="dot")) %>% 
  layout(title="Top 10 Countries with Most Winter Olympic Medals over time",
         hovermode = "x unified")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

The first interactive plot is built upon the time-series plot of over time medal comparison between countries in Part 1. I used the line-marker combo to make the curves more eye-catching, if not more aesthetically appealing, and added the x-unified hover mode to enable comparisons in specific cross-sections of time. The idea of interactivity fits this plot well because it’s hard to eyeball any detailed patterns in the static version but not in the dynamic one.

sorted_athlete2 <- head(athlete, 500) 

# Use ggplotly to automate the successful athlete static scatter plot
suc_ath_inter <- ggplot(data=sorted_athlete2,
                  aes(x=Medals, y=Events, color=Gender, label=Name)) +
  scale_color_grey() +
  theme_gdocs() +
  geom_point(size = 2) +
  labs(y="Events", x="Medals") + 
  ggtitle("Most Successful Athletes of All Time") +
  theme(plot.title = element_text(size = 18, hjust = 0.5),
        axis.title.x=element_text(size = 13),
        axis.title.y=element_text(size = 13))

ggplotly(suc_ath_inter)

Note that in this ggplotly “upgrade” I incorporated more athletes (500 over 300) because the specific information like athlete names won’t show up until someone selects a specific point. Therefore, the general view is a lot more clear than the static version when I don’t have to manually label those points.

6. Data Table

Prepare a selected data set and add a datatable to the output. Make sure the columns are clearly labelled. Select the appropriate options for the data table (e.g. search bar, sorting, column filters etc.). Suggest to the editor which kind of information you would like to provide in a data table in the online version of the article and why.

library(DT)

ath_counts <- aae3 %>% 
  count(Games, Name) %>% 
  mutate("n"=1) %>% 
  count(Games) %>% 
  rename(ath_counts = n)

avg_height <- aae3 %>%
  filter(Height != '0') %>% 
  group_by(Games) %>% 
  summarise(Avg_Height=mean(Height),
            .groups='drop')

avg_weight <- aae3 %>%
  filter(Weight != '0') %>% 
  group_by(Games) %>% 
  summarise(Avg_Weight=mean(Weight),
            .groups='drop')

medal_counts <- aae3 %>% 
  group_by(Games) %>% 
  summarise(Medals=sum(as.numeric(Medal)),
            .groups='drop')

gold_counts <- aae3 %>% 
  group_by(Games) %>% 
  summarise(Medals=sum(MedalGold),
            .groups='drop') %>% 
  rename(Gold=Medals)

silver_counts <- aae3 %>% 
  group_by(Games) %>% 
  summarise(Medals=sum(MedalSilver),
            .groups='drop') %>% 
  rename(Silver=Medals)

bronze_counts <- aae3 %>% 
  group_by(Games) %>% 
  summarise(Medals=sum(MedalBronze),
            .groups='drop') %>% 
  rename(Bronze=Medals)

noc_counts <- aae3 %>% 
  count(Games, NOC) %>% 
  count(Games) %>% 
  rename(noc_counts = n)

avg_weight <- avg_weight$Avg_Weight
avg_height <- avg_height$Avg_Height
ath_counts <- ath_counts$ath_counts
medal_counts <- medal_counts$Medals
gold_counts <- gold_counts$Gold
silver_counts <- silver_counts$Silver
bronze_counts <- bronze_counts$Bronze

good_data <- data.frame(noc_counts,medal_counts, gold_counts, silver_counts, bronze_counts, ath_counts, avg_height, avg_weight)

fancy_table <- datatable(good_data,
          rownames = TRUE,
          colnames=c("Games", "Countries","Medals","Gold", "Silver", "Bronze", "Athletes", "Average Height", "Average Weight")) %>% 
  formatStyle('Games',  color = 'white', 
              backgroundColor = 'blue', fontWeight = 'bold') %>% 
  formatStyle('medal_counts',
              background = styleColorBar(good_data$medal_counts, 'lightblue'),
              backgroundSize = '98% 88%',
              backgroudRepeat = 'no-repeat',
              backgroundPosition = 'center') %>%
  formatStyle('bronze_counts',
              background = styleColorBar(good_data$bronze_counts, 'brown'),
              backgroundSize = '98% 88%',
              backgroudRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle('gold_counts',
              background = styleColorBar(good_data$gold_counts, 'gold'),
              backgroundSize = '98% 88%',
              backgroudRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle('silver_counts',
              background = styleColorBar(good_data$silver_counts, 'grey'),
              backgroundSize = '98% 88%',
              backgroudRepeat = 'no-repeat',
              backgroundPosition = 'center')

fancy_table

This data table is a snapshot of some descriptive information of each Winter Olympics, including countries, medals, athletes, average height and weight. Those fundamental statistics would give any reader a clear sense of how each Winter Olympics differ. For example, there is a general increasing trend of everything, but 1980 Winter Olympics shows an oddly decrease in lots of indices in the table, so one might wonder if anything like boycotting happened. To make the data table more informative, one can also include host cities, number of sport events, gender ratio, etc.

Technical Details

The data comes in a reasonably clean Excel data set. If needed for your visualization, you can add visual drapery like flag icons, icons for sports, icons for medals etc. but your are certainly not obligated to do that.

Part of the your task will be transforming the dataset into a shape that allows you to plot what you want in ggplot2. For some plots, you will necessarily need to be selective in what to include and what to leave out.

Make sure to use at least three different types of graphs, e.g. line graphs, scatter, histograms, bar chats, dot plots, heat maps etc.

Submission

Please follow the instructions to submit your homework. The homework is due on Wednesday, February 16 at 5pm

Please stay honest!

Yes, the medal counts of the olympics have surely been analyzed before. If you do come across something, please no wholesale copying of other ideas. We are trying to practice and evaluate your abilities in using ggplot2 and data visualization not the ability to do internet searches. Also, this is an individually assigned exercise – please keep your solution to yourself.